Interaction Terms (Part 1)

Statistics for Data Science II

Introduction

  • Recall the general linear model,

y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k + \varepsilon

  • Until now, we have talked about models with only main effects.

    • e.g., x_1, x_2
  • Today, we will begin talking about interactions.

    • e.g., x_1 \times x_2

Interactions

  • Recall interactions from two-way ANOVA:

    • The relationship between the outcome and one predictor depends on the level of another predictor.
  • Interactions work (and are specified) the same way in regression.

  • The usual caveats apply:

    • We do not want to load models with too many interactions.

    • We favor simplicity over interactions that do not add much to the predictive power of the model.

    • We do not want higher than two-way interactions unless necessary.

  • Today we will focus on continuous \times continuous interactions.

Interactions

  • We will construct what is called a hierarchical well-formulated (HWF) model.

  • This means that when a higher-order interaction term is included in the model, all lower-order terms are also included.

  • e.g., when a two-way interaction is included, we also include the corresponding main effects.

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2

  • e.g., when a three-way interaction is included, we also include the corresponding main effects and two-way interactions.

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \beta_7 x_1 x_2 x_3

Today’s Data

library(tidyverse)
ratings <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-25/ratings.csv')
details <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-25/details.csv')
all <- full_join(ratings, details, by = "id")
head(all, n=5)

Example - Model

m1 <- lm(average ~ year + minplayers + playingtime + minplayers:playingtime, 
         data = all) 
summary(m1)

Call:
lm(formula = average ~ year + minplayers + playingtime + minplayers:playingtime, 
    data = all)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.2060 -0.5489  0.0409  0.6043  3.5575 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             6.112e+00  6.719e-02   90.97   <2e-16 ***
year                    3.643e-04  3.230e-05   11.28   <2e-16 ***
minplayers             -2.249e-01  8.964e-03  -25.09   <2e-16 ***
playingtime             4.719e-04  2.335e-05   20.21   <2e-16 ***
minplayers:playingtime -5.659e-05  3.762e-06  -15.04   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.903 on 21626 degrees of freedom
  (200 observations deleted due to missingness)
Multiple R-squared:  0.05603,   Adjusted R-squared:  0.05586 
F-statistic: 320.9 on 4 and 21626 DF,  p-value: < 2.2e-16

Example - Exploration

  • Wait, maybe we should explore the data first.
quantile(all$playingtime, c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE) 
   0%   25%   50%   75%  100% 
    0    25    45    90 60000 

Example - Exploration

  • How many games take longer than 5 hours (!!) to play?

    • There are 668 observations in this dataset.😳
head(all %>% filter(playingtime > 300), n = 3)

Example - Exploration

  • Let’s keep exploring. What about year?
quantile(all$year, c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE) 
  0%  25%  50%  75% 100% 
   0 2001 2011 2017 3500 

Example - Exploration

  • Let’s keep exploring. What about minimum player count?
all %>% count(minplayers)

Example - Final Data Cleaning

  • We now will perform some data cleaning.

    • Limit playing time to (0, 300]

    • Limit years to (1900, 2024)

    • Limit number of players to [1, \infty)

analytic <- all %>% 
  filter(playingtime <= 300 & # ≤ 300 min
         playingtime > 0 & # > 0 min
         year > 1900 & # filter out games without years
         year < 2024 & # filter out games with too large of years
         minplayers > 0) %>% # require at least 1 player for game
  select(id, name, average, year, playingtime, minplayers) %>%
  na.omit()

Example - Model

  • Let us now update our model,
m2 <- lm(average ~ year + minplayers + playingtime + minplayers:playingtime, 
         data = analytic)
summary(m2)

Call:
lm(formula = average ~ year + minplayers + playingtime + minplayers:playingtime, 
    data = analytic)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6664 -0.4417  0.0458  0.4970  3.5196 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -4.605e+01  8.230e-01 -55.956   <2e-16 ***
year                    2.612e-02  4.092e-04  63.837   <2e-16 ***
minplayers             -1.592e-01  1.209e-02 -13.175   <2e-16 ***
playingtime             5.202e-03  2.755e-04  18.881   <2e-16 ***
minplayers:playingtime  1.217e-04  1.341e-04   0.908    0.364    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7742 on 19904 degrees of freedom
Multiple R-squared:  0.2617,    Adjusted R-squared:  0.2616 
F-statistic:  1764 on 4 and 19904 DF,  p-value: < 2.2e-16

Example - Model Comparison


Call:
lm(formula = average ~ year + minplayers + playingtime + minplayers:playingtime, 
    data = all)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.2060 -0.5489  0.0409  0.6043  3.5575 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             6.112e+00  6.719e-02   90.97   <2e-16 ***
year                    3.643e-04  3.230e-05   11.28   <2e-16 ***
minplayers             -2.249e-01  8.964e-03  -25.09   <2e-16 ***
playingtime             4.719e-04  2.335e-05   20.21   <2e-16 ***
minplayers:playingtime -5.659e-05  3.762e-06  -15.04   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.903 on 21626 degrees of freedom
  (200 observations deleted due to missingness)
Multiple R-squared:  0.05603,   Adjusted R-squared:  0.05586 
F-statistic: 320.9 on 4 and 21626 DF,  p-value: < 2.2e-16

Call:
lm(formula = average ~ year + minplayers + playingtime + minplayers:playingtime, 
    data = analytic)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6664 -0.4417  0.0458  0.4970  3.5196 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -4.605e+01  8.230e-01 -55.956   <2e-16 ***
year                    2.612e-02  4.092e-04  63.837   <2e-16 ***
minplayers             -1.592e-01  1.209e-02 -13.175   <2e-16 ***
playingtime             5.202e-03  2.755e-04  18.881   <2e-16 ***
minplayers:playingtime  1.217e-04  1.341e-04   0.908    0.364    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7742 on 19904 degrees of freedom
Multiple R-squared:  0.2617,    Adjusted R-squared:  0.2616 
F-statistic:  1764 on 4 and 19904 DF,  p-value: < 2.2e-16

Example - Model

  • Suppose we weren’t actually interested in year as a predictor. Then,
m3 <- lm(average ~ playingtime + minplayers + playingtime:minplayers, 
         data = analytic)
summary(m3)

Call:
lm(formula = average ~ playingtime + minplayers + playingtime:minplayers, 
    data = analytic)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5089 -0.5153  0.0523  0.5703  3.1715 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             6.4594510  0.0284545 227.010  < 2e-16 ***
playingtime             0.0053204  0.0003024  17.595  < 2e-16 ***
minplayers             -0.1682568  0.0132652 -12.684  < 2e-16 ***
playingtime:minplayers -0.0004002  0.0001469  -2.725  0.00644 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8497 on 19905 degrees of freedom
Multiple R-squared:  0.1106,    Adjusted R-squared:  0.1104 
F-statistic: 824.8 on 3 and 19905 DF,  p-value: < 2.2e-16

Example - Model

  • I personally do not want to interpret playing time in terms of a 1 minute increase, so I will scale it by 60 minutes (so, 1 hour increments)
analytic <- analytic %>% 
  mutate(play60 = playingtime/60)
m4 <- lm(average ~ minplayers + play60 + minplayers:play60, 
         data = analytic)
summary(m4)

Call:
lm(formula = average ~ minplayers + play60 + minplayers:play60, 
    data = analytic)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5089 -0.5153  0.0523  0.5703  3.1715 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        6.459451   0.028454 227.010  < 2e-16 ***
minplayers        -0.168257   0.013265 -12.684  < 2e-16 ***
play60             0.319225   0.018143  17.595  < 2e-16 ***
minplayers:play60 -0.024012   0.008812  -2.725  0.00644 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8497 on 19905 degrees of freedom
Multiple R-squared:  0.1106,    Adjusted R-squared:  0.1104 
F-statistic: 824.8 on 3 and 19905 DF,  p-value: < 2.2e-16

Example - Model

  • Thus, our model is as follows,
coefficients(m4)
      (Intercept)        minplayers            play60 minplayers:play60 
       6.45945102       -0.16825678        0.31922456       -0.02401235 

\hat{\text{rating}} = 6.46 -0.17 \text{ min. players} + 0.32 \text{ time} - 0.02 \text{ min. players $\times$ time}

  • … how do we interpret the interaction?

  • Note that we cannot interpret the main effects of the interaction term

    • i.e., we cannot interpret minimum player count and playing time alone.

Example - Model Dissection

  • If we look at the interaction term,
coefficients(m4)["minplayers:play60"]
minplayers:play60 
      -0.02401235 
  • One general interpretation is that as playing time increases, the slope describing the relationship between average rating and minimum player count decreases.

  • The other general interpretation is that as minimum player count increases, the slope describing the relationship between average rating and playing time decreases.

Example - Simplifying Model

  • For visualization purposes, we will plug in values and reduce down.
  • Let’s plug in for a minimum player count of 2.

\begin{align*} \hat{\text{rating}} &= 6.46 -0.17 \text{ min. players} + 0.32 \text{ time} - 0.02 \text{ min. players $\times$ time} \\ &= 6.46 -0.17(2) + 0.32 \text{ time} - 0.02 (2) \text{ time} \\ &= 6.46 -0.34 + 0.32 \text{ time} - 0.04 \text{ time} \\ &= (6.46 -0.34) + (0.32 - 0.04) \text{ time} \\ &= 6.12 + 0.28 \text{ time} \end{align*}

  • Let’s plug in for a minimum player count of 4.

\begin{align*} \hat{\text{rating}} &= 6.46 -0.17 \text{ min. players} + 0.32 \text{ time} - 0.02 \text{ min. players $\times$ time} \\ &= 6.46 -0.17(4) + 0.32 \text{ time} - 0.02 (4) \text{ time} \\ &= 6.46 -0.68 + 0.32 \text{ time} - 0.08 \text{ time} \\ &= (6.46 -0.68) + (0.32 - 0.08) \text{ time} \\ &= 5.78 + 0.24 \text{ time} \end{align*}

  • Thus, we have the two models,

\begin{align*} \hat{\text{rating}}_{\text{player = 2}} &= 6.12 + 0.28 \text{ time} \\ \hat{\text{rating}}_{\text{player = 4}} &= 5.78 + 0.24 \text{ time} \end{align*}

Interpretations

  • Interaction terms are tricky with interpretations.

  • To interpret the interaction, remember that the relationship between the outcome and one predictor depends on the level of a second predictor.

    • The interaction term itself gives you an idea of what’s happening.

    • We can also provide proper interpretations once we simplify the model.

\hat{\text{rating}} = 6.46 -0.17 \text{ min. players} + 0.32 \text{ time} - 0.02 \text{ min. players $\times$ time}

  • As minimum player count increases, the slope describing the relationship between average rating and playing time decreases by 0.02 rating points.

\begin{align*} \hat{\text{rating}}_{\text{player = 2}} &= 6.12 + 0.28 \text{ time} \\ \hat{\text{rating}}_{\text{player = 4}} &= 5.78 + 0.24 \text{ time} \end{align*}

  • When looking at a 2 player game, for every hour increase in time-to-play, the game’s rating increases by 0.28 rating points.

  • When looking at a 4 player game, for every hour increase in time-to-play, the game’s rating increases by 0.24 rating points.

Example - Predicted Values

  • We want to create a visualization for our model.

    • We have “taken care of” the interaction by plugging in values for minimum player count.
  • We need to decide what goes on the x-axis.

    • In this (very simplistic) example, we only have one variable - playing time in hours.

    • In other, more complicated models, we will have to choose one to allow to vary (and include it on the x-axis), then create predicted values while holding everything else in the model constant.

Example - Model Visualization

analytic <- analytic %>%
  mutate(rating_2 = 6.12 + 0.28*play60,
         rating_4 = 5.78 + 0.24*play60)
analytic %>% 
  ggplot(aes(x = play60, y = average, color = as.factor(minplayers))) +
  geom_point() +
  geom_line(aes(y = rating_2), color = "hotpink") +
  geom_line(aes(y = rating_4), color = "purple") +
  labs(x = "Playing Time (Hours)",
       y = "Average Rating on BGG",
       color = "Minimum Number of Players") +
  theme_bw()

Wrap Up

  • We covered the simplest case of interactions in this lecture, continuous \times continuous.

  • Next lecture, we will continue talking about interactions, including when categorical predictors are involved.